Introduction

One of the biggest issues the world is currently facing is the rising numbers of terrorist attacks. Most people hear via the news or on the internet but we never really spectulate on the statistics behind terrorism. In this tutorial we will explore a dataset related to terrostic attacks to demonstrate the process of the data science pipeline.

Getting Started

The dataset we will be using is from the Global Terrorism Database, which has data available from 1970-2016 which is alos available here. The purpose for choosing this data set is that it has many attributes to describe each entity, which is represented as an act of terrorism. From the data we can start to analyze the statistical outcomes in the hopes of discovering trends.

I first downloaded the data as a comma seperated value file, and collected the attributes necessary for an analysis.

Renaming of column attributes:

setnames(tidy_terror, old=c("iyear","country_txt","region_txt", "attacktype1_txt","targtype1_txt","weaptype1_txt"), 
         new=c("Year", "Country","Region","Attack_Type","Target_Type","Weapon"))

Plotting

To get an understanding of what the data is representing in the broadest of scales, lets first display a graph of the number of successful attacks world wide over the entire range of the dataset, which is from 1970-2016. The intiution is that we may be able to find a trend or an interesting observation that could lead to further exploration of the data. In order to do this we can use the ggplot2 package to visualize the data as a plot of points connected by line segments.

tidy_terror <-na.omit(tidy_terror)

plot <- tidy_terror %>%
  group_by(Year) %>%
  summarize(success = sum(success)) %>%
  ggplot(mapping=aes(y = success, x = Year)) +
           geom_line(color = "blue") +
    geom_smooth(method =lm)+
  geom_point(size=1.75, colour="#CC0000")  +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
   labs(title="Number of Successful Terrorist Attacks Worldwide (1970-2016)",
         y = "# of Successful Attacks",
         x = "Year")+
scale_x_continuous(breaks=seq(1970,2016,by=5)) 
  scale_y_continuous(breaks=seq(0,15000,by=5000))
## <ScaleContinuousPosition>
##  Range:  
##  Limits:    0 --    1
plot

From the plot above its very clear that there seems to be a corelation between the number of successful attacks over the last 46 years. The rise in successful attacks over the last 4 decades could stem from a number of theories. The first being a rise in climate change, which creates a number of scenarios in which terrorists can use valuable resources to gain control over people, this is explimpfied in areas plagued by famine and drought, where water can be control by a terrorist organization. The world has also seen a rise in radical terrorist groups starting in the early 2000s, which could explain why the graph experiences a huge spike. Climate change & terrorism link For information reguarding the ggplot2 package, check out this link!

So now that we have a general idea for the trend, or more appropriately, the rise in terrorism over the last 46 years. Lets try to better visualize the data by displaying the # of succesful attacks by region, to see if any particular regions stick out.

plot <- tidy_terror %>%
  group_by(Year, Region) %>%
  summarize(success = sum(success)) %>%
  ggplot(mapping=aes(y = success, x = Year, color= Region)) +
  geom_point()+
  geom_line()+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
   labs(title="Number of Successful Terrorist Attacks By Region (1970-2016)",
         y = "# of Successful Attacks",
         x = "Year")+
scale_x_continuous(breaks=seq(1970,2016,by=5))+
  scale_y_continuous(breaks=seq(0,6500,by=500))
plot

From the new graph we can see that the Middle East & Northern Africa have experienced huge spikes along with South America and Sub-Sahara Africa. Our original hypothesis for a rise in terrrorism due to radicalism could prove valid for the spike in the Middle East & North Africa line, but what can we say for South America and Sub-Sahara Africa? Well for starters, all 3 regions are generally less developed economically when compared to North America/Europe who have experience far less terrorist attacks over the years.

Linear Regression Fit

Since we are more focused on the rise in terrorism we can eliminate the regions that appear to have a stagnat overall growth in successful attacks and instead focus on the top 4 regions: Middle East & North Africa, South Asia, Sub-Saharan Africa and Southeast Asia. We also only care about the recent rise, since most regions have displayed an overall stagnant or slow increase before the year 2000, so lets consider the data from 2000-2016. Fitting each region to a linear regression line will give us a better understanding of the trend of the data.

top_terror <- tidy_terror %>%
  filter(Region == c("Middle East & North Africa","South Asia","Sub-Saharan Africa","Southeast Asia"), Year >= 2000)
  
plot <- top_terror %>%
  group_by(Year, Region) %>%
  summarize(success = sum(success)) %>%
  ggplot(mapping=aes(y = success, x = Year, color= Region)) +
  geom_point()+
  geom_line()+
  geom_smooth(method=lm)+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
   labs(title="Number of Successful Terrorist Attacks for the Top 4 Regions (2000-2016)",
         y = "# of Successful Attacks",
         x = "Year")+
scale_x_continuous(breaks=seq(2000,2016,by=2))+
  scale_y_continuous(breaks=seq(0,6500,by=200))
plot

From the linear regression lines we generated for each of the top 4 regions we can see that although each line has a few outliers, the overall slopes are heading in the positive direction over the past 16 years. We can now start to think about whether or not there is a relationship between successful attacks and the type of attack.

Lets look at the relationship between the number of successful attacks over time to the type of attack, sepereated by region and see if we can learn anything about the relationship. We will look at the Middle East & North Africa, Southeastern Asia, South Asia and Sub-Sahara Africa regions for this analysis:

library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.4.4
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
plot1 <- top_terror %>%
  filter(Region == "Middle East & North Africa") %>%
  group_by(Year,Attack_Type) %>%
  summarize(success = sum(success)) %>%
  ggplot(mapping=aes(y = success, x = Year, color=Attack_Type, fill = Attack_Type)) +
        geom_bar(stat="identity")+
   theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
   labs(title="Distribution of Attack Types for the Middle East & North Africa",
         y = "# of Successful Attacks",
         x = "Year")+
     scale_x_continuous(breaks=seq(2000,2016,by=2))

 plot1

plot2 <- top_terror %>%
  filter(Region == "South Asia") %>%
  group_by(Year,Attack_Type) %>%
  summarize(success = sum(success)) %>%
  ggplot(mapping=aes(y = success, x = Year, color=Attack_Type, fill = Attack_Type)) +
        geom_bar(stat="identity")+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
   labs(title="Distribution of Attack Types for South Asia",
         y = "# of Successful Attacks",
         x = "Year")+
     scale_x_continuous(breaks=seq(2000,2016,by=2))

  plot2

  plot3 <- top_terror %>%
  filter(Region == "Sub-Saharan Africa") %>%
  group_by(Year,Attack_Type) %>%
  summarize(success = sum(success)) %>%
  ggplot(mapping=aes(y = success, x = Year, color=Attack_Type, fill = Attack_Type)) +
        geom_bar(stat="identity")+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
   labs(title="Distribution of Attack Types for Sub-Saharan Africa",
         y = "# of Successful Attacks",
         x = "Year")+
     scale_x_continuous(breaks=seq(2000,2016,by=2))

  plot3

  plot4 <- top_terror %>%
  filter(Region == "Southeast Asia") %>%
  group_by(Year,Attack_Type) %>%
  summarize(success = sum(success)) %>%
  ggplot(mapping=aes(y = success, x = Year, color=Attack_Type, fill = Attack_Type)) +
        geom_bar(stat="identity")+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
   labs(title="Distribution of Attack Types for Southeast Asia",
         y = "# of Successful Attacks",
         x = "Year")+
     scale_x_continuous(breaks=seq(2000,2016,by=2))


  plot4

From all 4 bar plots its clear that Armed Assult, Assasination and Bombing/Explosion appear in that order, as the top 3 types of attacks that all 4 regions experience. Whats interesting is that each reach seems to follow almost an identitical relationship of increasing for the top 3 attack types is also increaseing at the same rates across each of the 4 regions over the past 16 years.

To get the annual success rate, we should group the data by the year and by the successes. This requires us to create some new attributes to our dataset. We will sum the total successes and the total number of indicents and create a success_rate attribute which divides the total successes by the total number of incidents, which are both also added as attributes.

success_rate_df <- tidy_terror %>%
  group_by(Year) %>%
  summarize(success=sum(success), x=n())%>%
   mutate(total_successes =success, total_attempts = x,success_rate = total_successes/total_attempts) %>%
  select(Year, total_successes, total_attempts, success_rate)

success_rate_df
## # A tibble: 46 x 4
##     Year total_successes total_attempts success_rate
##    <int>           <int>          <int>        <dbl>
##  1  1970             542            643        0.843
##  2  1971             410            461        0.889
##  3  1972             448            492        0.911
##  4  1973             425            465        0.914
##  5  1974             540            576        0.938
##  6  1975             693            727        0.953
##  7  1976             839            900        0.932
##  8  1977            1167           1292        0.903
##  9  1978            1340           1453        0.922
## 10  1979            2280           2529        0.902
## # ... with 36 more rows
plot <- success_rate_df %>%
   ggplot(mapping=aes(y = success_rate, x = Year)) +
  geom_line(color = "blue") +
    geom_smooth(method =lm)+
  geom_point(size=1.75, colour="#CC0000")  +
  geom_smooth(method=lm)+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
   labs(title="Percentage of Successful Terrorist Attack Attempts by Year (1970-2016)",
         y = "% of Successful Attacks",
         x = "Year")+
 scale_x_continuous(breaks=seq(1970,2016,by=5))+
  scale_y_continuous(breaks=seq(0.8,1,by=.02))

plot

The graph does not appear to follow any linear relationship between the percentage of successful attacks overtime. What is interesting is although the rate seems to be decreasing from about 2007 and onward, the ammount of overall successful attacks per year is still rising while the success rate is decreasing. This implies that the total amount of crimes must be increasing at extreme rates to counter the decreasing percentage of successful attacks to keep an overall increasing attack total per year. Performing a linear regression for this specific data would be futile because we see no apparent relationship between the % of sucessful attacks over time.

Interactive Data Maps

While seeing graphical presentations of the data is an effective way to relay infromation about the dataset, we can also take advantage of the leaflet package. This package allows us to represent data dynamically on a map. Using the latitude and longitude points from the dataset we can better visualize which parts in particular are experiencing the highest levels of terrorism. We will accomplish this using the WebGLHeatmap function, which allows us to add a heatmap overlay onto our map using the lat and long points to plot the specific events of an act of terrorism.

library(leaflet)
## Warning: package 'leaflet' was built under R version 3.4.4
library(maps)
## Warning: package 'maps' was built under R version 3.4.4
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
library(leaflet.extras)
## Warning: package 'leaflet.extras' was built under R version 3.4.4
  worldmap <- leaflet(tidy_terror) %>%
  addTiles() %>%
  addWebGLHeatmap(lng=~longitude, lat=~latitude, size =100000) %>%
  setView(lat=39.29, lng=-76.61, zoom=1.5)

worldmap

The map gives us

Does standardizing winrate for each region affect if winrate will increase or decrease a year from now?

We can address this question by first, filtering to the years of interest and standardize successful attacks % for each region. To train our model we need a table with one row per region, We will do this in stages, first we turn the tidy dataset into a wide dataset using tidyr::spread then create a dataframe containing the differences we use as features. Finally, add the outcome we want to predict from the data frame we created previously. Now split up the data set into training regions and testing regions (using 80/20 random split). We can then make predictions on the held out test set and can make a confusion matrix to calculate error rate.

success_rate_df1 <- tidy_terror %>%
  group_by(Region,Year) %>%
  summarize(success=sum(success), x=n())%>%
   mutate(total_successes =success, total_attempts = x,success_rate = total_successes/total_attempts) %>%
  select(Year,Region, total_successes, total_attempts, success_rate)
success_rate_df1
## # A tibble: 513 x 5
## # Groups:   Region [12]
##     Year Region                total_successes total_attempts success_rate
##    <int> <chr>                           <int>          <int>        <dbl>
##  1  1970 Australasia & Oceania               1              1        1.00 
##  2  1971 Australasia & Oceania               1              1        1.00 
##  3  1972 Australasia & Oceania               2              2        1.00 
##  4  1973 Australasia & Oceania               1              1        1.00 
##  5  1974 Australasia & Oceania               1              1        1.00 
##  6  1978 Australasia & Oceania               2              2        1.00 
##  7  1979 Australasia & Oceania               2              2        1.00 
##  8  1980 Australasia & Oceania               6              7        0.857
##  9  1981 Australasia & Oceania               3              3        1.00 
## 10  1982 Australasia & Oceania               2              2        1.00 
## # ... with 503 more rows
standardized_df1 <- success_rate_df1 %>%
  filter(Year >=2000) %>%
   group_by(Region ) %>%
   mutate(mean_success = mean(success_rate)) %>%
  mutate(sd_success = sd(success_rate)) %>%
  mutate(zScore_success = (success_rate - mean_success) / sd_success) %>%
  ungroup()

 
wide_df1 <- standardized_df1 %>%
  select(Region, Year, zScore_success) %>%
  tidyr::spread(Year, zScore_success)
#wide_df

 
matrix_1a <- wide_df1 %>%
  select(-Region) %>%
  as.matrix() %>%
  .[,-1]

matrix_2a <- wide_df1 %>%
  select(-Region) %>%
  as.matrix() %>%
  .[,-ncol(.)]

diff_df1a <- (matrix_1a - matrix_2a) %>%
  magrittr::set_colnames(NULL) %>%
  as_data_frame() %>%
  mutate(Region = wide_df1$Region)
 

final_df1a <- diff_df1a %>%
  inner_join(tidy_terror %>% select(Region, success), by="Region") %>%
  mutate(success=factor(success, levels=c(1, 0)))
 #final_df


set.seed(1234)
test_random_forest_df1a <- final_df1a %>%
  group_by(success) %>%
  sample_frac(.1) %>%
  ungroup()
test_random_forest_df1a <-na.omit(test_random_forest_df1a)

train_random_forest_df1a <- final_df1a %>%
  anti_join(test_random_forest_df1a, by="Region")

train_random_forest_df1a <-na.omit(train_random_forest_df1a)
train_random_forest_df1a
## # A tibble: 0 x 18
## # ... with 18 variables: V1 <dbl>, V2 <dbl>, V3 <dbl>, V4 <dbl>, V5 <dbl>,
## #   V6 <dbl>, V7 <dbl>, V8 <dbl>, V9 <dbl>, V10 <dbl>, V11 <dbl>,
## #   V12 <dbl>, V13 <dbl>, V14 <dbl>, V15 <dbl>, V16 <dbl>, Region <chr>,
## #   success <fct>
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.4.4
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
rf1a <- randomForest(success~., data=test_random_forest_df1a %>% select(-Region))
rf1a
## 
## Call:
##  randomForest(formula = success ~ ., data = test_random_forest_df1a %>%      select(-Region)) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 10.86%
## Confusion matrix:
##       1 0 class.error
## 1 13892 0           0
## 0  1692 0           1
test_predictions1a <- predict(rf1a, newdata=train_random_forest_df1a %>% select(-Region))
table(pred=test_predictions1a, observed=train_random_forest_df1a$success)
##     observed
## pred 1 0
##    1 0 0
##    0 0 0

Lets take a t-statistic of the linear fit for success-rate and year to see if there is a linear dependent. We can set our null hypothesis as no relationship between success-rate vs time. From the statistics we would not reject the null hypothesis because from the t-statistics we can see that our probability is 0.206 which is rather large for a pvalue, thus we can imply the probability of the null hypothesis being true is likely which is why we would accept the null hypothesis. Our results also match our intuition about the graph of the relationship above, which also exhibited no appareant linear relationships.

set.seed(1234)

# create a linear regression model for the data
 
regression_fit = lm(success_rate~Year, data = standardized_df1,family = binomial)
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'family' will be disregarded
regression_fit %>% 
  tidy()  
##          term     estimate   std.error statistic   p.value
## 1 (Intercept)  5.384201734 4.246757723  1.267838 0.2063598
## 2        Year -0.002253677 0.002114907 -1.065615 0.2879086

For more information about logistical regression visit this link!

Summary

This tutorial displayed some of the many features that we can utilize in R to represent the full data science pipeline. Many different approaches can be made as far as handling, displaying and analyzing the data, which makes R a great language for data science. Overall, the dataset for terrorist attacks provided an opportunity to learn some interesting information based off of a few simple techniques that data science employs all the time!